智能论文笔记

Federated NLP in Few-shot Scenarios

Dongqi Cai , Shangguang Wang , Yaozong Wu , Felix Xiaozhu Lin , Mengwei Xu

分类：机器学习 | 自然语言处理

2022-12-12

Natural language processing (NLP) sees rich mobile applications. To support various language understanding tasks, a foundation NLP model is often fine-tuned in a federated, privacy-preserving setting (FL). This process currently relies on at least hundreds of thousands of labeled training samples from mobile clients; yet mobile users often lack willingness or knowledge to label their data. Such an inadequacy of data labels is known as a few-shot scenario; it becomes the key blocker for mobile NLP applications. For the first time, this work investigates federated NLP in the few-shot scenario (FedFSL). By retrofitting algorithmic advances of pseudo labeling and prompt learning, we first establish a training pipeline that delivers competitive accuracy when only 0.05% (fewer than 100) of the training data is labeled and the remaining is unlabeled. To instantiate the workflow, we further present a system FFNLP, addressing the high execution cost with novel designs. (1) Curriculum pacing, which injects pseudo labels to the training workflow at a rate commensurate to the learning progress; (2) Representational diversity, a mechanism for selecting the most learnable data, only for which pseudo labels will be generated; (3) Co-planning of a model's training depth and layer capacity. Together, these designs reduce the training delay, client energy, and network traffic by up to 46.0$\times$, 41.2$\times$ and 3000.0$\times$, respectively. Through algorithm/system co-design, FFNLP demonstrates that FL can apply to challenging settings where most training samples are unlabeled.

translated by 谷歌翻译

AUG-FedPrompt: Practical Few-shot Federated NLP with Data-augmented Prompts

Dongqi Cai , Yaozong Wu , Haitao Yuan , Shangguang Wang , Felix Xiaozhu Lin , Mengwei Xu

分类：自然语言处理 | 机器学习

2022-12-01

Transformer-based pre-trained models have become the de-facto solution for NLP tasks. Fine-tuning such pre-trained models for downstream tasks often requires tremendous amount of data that is both private and labeled. However, in reality: 1) such private data cannot be collected and is distributed across mobile devices, and 2) well-curated labeled data is scarce. To tackle those issues, we first define a data generator for federated few-shot learning tasks, which encompasses the quantity and distribution of scarce labeled data in a realistic setting. Then we propose AUG-FedPrompt, a prompt-based federated learning algorithm that carefully annotates abundant unlabeled data for data augmentation. AUG-FedPrompt can perform on par with full-set fine-tuning with very few initial labeled data.

translated by 谷歌翻译

Bridging the gap between target-based and cell-based drug discovery with a graph generative multi-task model

Fan Hu , Dongqi Wang , Huazhen Huang , Yishen Hu , Peng Yin

分类：机器学习

2022-08-09

药物发现对于保护人免受疾病至关重要。基于目标的筛查是过去几十年来开发新药的最流行方法之一。该方法有效地筛选了候选药物在体外抑制靶蛋白，但由于体内所选药物的活性不足，它通常失败。需要准确的计算方法来弥合此差距。在这里，我们提出了一个新的图形多任务深度学习模型，以识别具有目标抑制性和细胞活性（matic）特性的化合物。在经过精心策划的SARS-COV-2数据集中，提出的Matic模型显示了与传统方法相比，在筛选体内有效化合物方面的优点。接下来，我们探索了模型的解释性，发现目标抑制（体外）或细胞活性（体内）任务的学习特征与分子属性相关性和原子功能专注不同。基于这些发现，我们利用了基于蒙特卡洛的增强性学习生成模型来生成具有体外和体内功效的新型多毛皮化合物，从而弥合了基于靶基于靶基于靶标的药物和基于细胞的药物发现之间的差距。

translated by 谷歌翻译

Reinforcement learning on graphs: A survey

Nie Mingshuo , Chen Dongming , Wang Dongqi

分类：机器学习

2022-04-13

Graph mining tasks arise from many different application domains, ranging from social networks, transportation to E-commerce, etc., which have been receiving great attention from the theoretical and algorithmic design communities in recent years, and there has been some pioneering work employing the research-rich Reinforcement Learning (RL) techniques to address graph data mining tasks. However, these graph mining methods and RL models are dispersed in different research areas, which makes it hard to compare them. In this survey, we provide a comprehensive overview of RL and graph mining methods and generalize these methods to Graph Reinforcement Learning (GRL) as a unified formulation. We further discuss the applications of GRL methods across various domains and summarize the method descriptions, open-source codes, and benchmark datasets of GRL methods. Furthermore, we propose important directions and challenges to be solved in the future. As far as we know, this is the latest work on a comprehensive survey of GRL, this work provides a global view and a learning resource for scholars. In addition, we create an online open-source for both interested scholars who want to enter this rapidly developing domain and experts who would like to compare GRL methods.

translated by 谷歌翻译

threaTrace: Detecting and Tracing Host-based Threats in Node Level Through Provenance Graph Learning

Su Wang , Zhiliang Wang , Tao Zhou , Xia Yin , Dongqi Han , Han Zhang , Hongbin Sun , Xingang Shi , Jiahai Yang

分类：机器学习

2021-11-08

基于主机的威胁，如程序攻击，恶意软件植入和高级持久威胁（APT）通常由现代攻击者采用。最近的研究建议利用数据出处中的丰富的上下文信息来检测主机中的威胁。数据出处是由系统审核数据构造的定向非循环图。来源图中的节点代表系统实体（例如，$ Process $和$文件$），并且边缘代表信息流方向的系统调用。然而，以前的研究，其中提取整个来源图的特征，对少量威胁相关实体不敏感，因此在狩猎隐秘威胁时导致低性能。我们提出了基于异常的基于异常的探测器，可以在没有攻击模式的情况下检测系统实体级别的基于主机的威胁。我们量身定制Graphsage，一个感应图形神经网络，以在出处图中学习每个良性实体的角色。 ThreaTrace是一个实时系统，可扩展，监控长期运行主机，并能够在早期阶段检测基于主机的入侵。我们在三个公共数据集中评估触角。结果表明，ThreaTrace优于三种最先进的主机入侵检测系统。

translated by 谷歌翻译

Non-Parametric Online Learning from Human Feedback for Neural Machine Translation

Dongqi Wang , Haoran Wei , Zhirui Zhang , Shujian Huang , Jun Xie , Jiajun Chen

分类：自然语言处理

2021-09-23

我们研究了在循环机器翻译中对人体反馈的在线学习问题，其中人类翻译人员修改了机器生成的翻译，然后使用校正的翻译来改善神经电机翻译（NMT）系统。然而，以前的方法需要在线模型更新或额外的翻译记忆网络来实现高质量的性能，使它们在实践中不灵活和效率低下。在本文中，我们提出了一种新颖的非参数在线学习方法而不改变模型结构。这种方法引入了两个K-Cirelte-邻（KNN）模块：一个模块记住了人类反馈，这是人类翻译人员提供的正确句子，而另一个模块是自适应地平衡历史人体反馈和原始NMT模型的使用。在EMEA和JRC-ACQUIS基准上进行的实验表明，我们所提出的方法对翻译准确性的大量改进，并通过更少的人力校正操作实现更好的适应性能。

translated by 谷歌翻译

MentorGNN: Deriving Curriculum for Pre-Training GNNs

Dawei Zhou , Lecheng Zheng , Dongqi Fu , Jiawei Han , Jingrui He

分类：机器学习

2022-08-21

图形预训练策略一直在图形挖掘社区吸引人们的注意力，因为它们在没有任何标签信息的情况下在参数化图形神经网络（GNN）方面的灵活性。关键思想在于通过预测从输入图中提取的掩蔽图信号来编码有价值的信息。为了平衡各种图形信号的重要性（例如节点，边缘，子图），现有方法主要是通过引入超参数来重新进行图形信号的重要性来进行手工设计的。然而，人类对亚最佳高参数的干预通常会注入额外的偏见，并在下游应用中降低了概括性能。本文从新的角度解决了这些局限性，即为预培训GNN提供课程。我们提出了一个名为Mentorgnn的端到端模型，该模型旨在监督具有不同结构和不同特征空间的图表的GNN的预训练过程。为了理解不同粒度的异质图信号，我们提出了一种课程学习范式，该课程自动重新贴出图形信号，以确保对目标域进行良好的概括。此外，我们通过在预先训练的GNN的概括误差上得出自然且可解释的上限，从而对关系数据（即图形）的域自适应问题（即图形）发出了新的启示。有关大量真实图的广泛实验验证并验证了Mentorgnn的性能。

translated by 谷歌翻译

Privacy-preserving Graph Analytics: Secure Generation and Federated Learning

Dongqi Fu , Jingrui He , Hanghang Tong , Ross Maciejewski

分类：机器学习

2022-06-30

由国土安全企业与安全相关的应用程序直接激励，我们着重于对图形数据的隐私保护分析，该分析提供了代表丰富属性和关系的关键能力。特别是，我们讨论了两个方向，即保护隐私图和联合图形学习，这可以共同使每个拥有私人图形数据的多个政党之间的协作。对于每个方向，我们都确定“快速获胜”和“硬问题”。最后，我们演示了一个可以促进模型解释，解释和可视化的用户界面。我们认为，在这些方向上开发的技术将大大提高国土安全企业的能力，以应对和减轻各种安全风险。

translated by 谷歌翻译

DISCO: Comprehensive and Explainable Disinformation Detection

Dongqi Fu , Yikun Ban , Hanghang Tong , Ross Maciejewski , Jingrui He

分类：机器学习

2022-03-09

虚假信息是指故意传播的虚假信息以影响公众，而虚假信息对社会的负面影响可以在许多问题（例如政治议程和操纵金融市场）中观察到。在本文中，我们确定了从多个方面的自动虚假信息检测相关的普遍挑战和进步，并提出了一个称为迪斯科的全面和可解释的虚假发现检测框架。它利用了虚假信息的异质性，并解决了预测的不透明性。然后，我们以令人满意的检测准确性和解释为现实世界中的假新闻检测任务提供了迪斯科舞厅的演示。迪斯科的演示视频和源代码现已公开可用。我们希望我们的演示可以为解决整体的识别，理解和解释性的局限性铺平道路。

translated by 谷歌翻译

Deeper-GXX: Deepening Arbitrary GNNs

Lecheng Zheng , Dongqi Fu , Ross Maciejewski , Jingrui He

分类：机器学习

2021-10-26

浅GNN倾向于与具有缺失功能的大型图形或图形相关性能。因此，有必要增加GNN的深度（即，层数），以捕获对输入数据的更多潜在知识。另一方面，包括GNN中的更多层通常会降低其性能，例如消失的梯度和过度平滑。现有的方法（例如，配对和DropEdge）主要集中于解决过度厚度，但它们遭受了一些缺点，例如需要难以提高知识或进行大型培训随机性。此外，这些方法只是将重新连接到解决消失的梯度。他们忽略了一个重要的事实：与从遥远的邻居中收集的信息相比，与从1跳和2跳的邻居收集的信息相比，从遥远的邻居收集的信息变得占主导地位，从而导致严重的性能退化，从而使其占主导地位。在本文中，我们首先深入研究了Resnet的架构，并分析了为什么Resnet最不适合更深的GNN。然后，我们提出了一种新的残留体系结构，以减轻重新系统造成的负面影响。为了解决这些现有方法的缺点，我们介绍了名为TGCL的拓扑引导的图形对比损失。它利用节点拓扑信息，并通过对比度学习正则化将连接的节点对靠近，以获得歧视性节点表示。将新的残留体系结构与TGCL相结合，提出了一个名为更深的GNNS的端到端框架。对现实世界数据集的广泛实验证明了与最先进的基线相比，更深型GXX的有效性和效率。

translated by 谷歌翻译